Prediction of compound-target interaction using several artificial intelligence algorithms and comparison with a consensus-based strategy

For understanding a chemical compound’s mechanism of action and its side effects, as well as for drug discovery, it is crucial to predict its possible protein targets. This study examines 15 developed target-centric models (TCM) employing different molecular descriptions and machine learning algorithms. They were contrasted with 17 third-party models implemented as web tools (WTCM). In both sets of models, consensus strategies were implemented as potential improvement over individual predictions. The findings indicate that TCM reach f1-score values greater than 0.8. Comparing both approaches, the best TCM achieves values of 0.75, 0.61, 0.25 and 0.38 for true positive/negative rates (TPR, TNR) and false negative/positive rates (FNR, FPR); outperforming the best WTCM. Moreover, the consensus strategy proves to have the most relevant results in the top \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20\%$$\end{document}20% of target profiles. TCM consensus reach TPR and FNR values of 0.98 and 0; while on WTCM reach values of 0.75 and 0.24. The implemented computational tool with the TCM and their consensus strategy at: https://bioquimio.udla.edu.ec/tidentification01/. Scientific Contribution: We compare and discuss the performances of 17 public compound-target interaction prediction models and 15 new constructions. We also explore a compound-target interaction prioritization strategy using a consensus approach, and we analyzed the challenging involved in interactions modeling. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00816-1.

All raw data from the three datasets releases (27, 28, and 31) were curated to ensure it is of high quality for the experiments.After that, a total of 350818 compounds, 1521 targets, and 507553 compound-protein interactions common to database releases 27 and 28 were taken for training target-centric models (TCM).The remaining compound-target associations found in release 31 and not in the other two (27 and 28) were considered for external validation(52874 compounds, 1196 targets, and 74987 molecule-target associations).Additionally, a minimum of 5 active/inactive compounds in both datasets was assessed to have consistent data.Therefore, the dataset for training TCM was reduced to DS1 (253 targets, 184046 compounds, and 249269 interactions).This also reduced the dataset for external validation which was labeled as VSD2 (253 targets, 30526 compounds, and 42382 interactions).The full data curation process was summarized in Figure SM1. 1.Moreover, for evaluating the target-centric models from the web tools (WTCM)and for comparing them with TCM and their consensus strategy, the dataset VSD3 was created.This is because, even though the initial idea was to use VDS2, it was not possible to take all the compounds from VSD2 and performs a prediction with the WTCM due to several external factors (unavailability of service, limited information, slow response times among others) from the web tools.Therefore, just a sample of 3264 molecules was taken into account.

Note SM1.2. Target-centric Models (TCM)
For each target in DS1, independently of the total of positive and negative interactions with compounds, individual target-centric models (TCM) were trained using three different molecular representations as features: 1) 1024 bits of Morgan's fingerprint with a radius of eight (FGP); 2) a set of 123 general physiochemical, structural, and topological molecular properties (DSC) (see Table SM1.1); and 3) The union of both FGP and DSC descriptors (FUS).The set of binary digits (bits) called FGPs represents the presence or absence of structural features in the molecule, where atom environments are included [3].All molecular descriptors and fingerprints used in this study were computed with an open-source cheminformatics software RDKit1 through its Python implementation [4].
Since Target identification was conceived as a classification problem and there is limited information from a compound about complex interactions for most of the targets, in this work, a TCM was built for each target with common ML algorithms instead of sophisticated algorithms like neural networks.The models were trained with the following five popular ML algorithms: Decision Tree (DT): uses a set of rules to make decisions, much like humans make decisions [5].It employs the features to create yes/no questions and to split the dataset until the compounds are isolated into a particular class.Random Forest (RF): employs averaging to increase predictive accuracy and reduce overfitting by fitting several decision tree classifiers to different subsamples of the The number of valence electrons the molecule has.

TPSA
The polar surface area of a molecule based upon fragments.

RotatableBonds
The number of Rotatable Bonds LabuteASA The Labute ASA value for a molecule.

AmideBonds
The number of amide bonds in a molecule.

Heteroatoms
The number of heteroatoms for a molecule.

SpiroAtoms
The number of spiro atoms (atoms shared between rings that share exactly one atom).

BridgeheadAtoms
The number of bridgehead atoms (atoms shared between rings that share at least two bonds).

FractionCSP3
The fraction of C atoms that are SP3 hybridized.

AromaticRings
The number of aromatic rings for a molecule.

SaturatedRings
The number of saturated rings for a molecule.

AliphaticRings
The number of aliphatic (containing at least one non-aromatic bond) rings for a molecule.

AromaticCarbocycles
The number of aromatic carbocycles for a molecule.SaturatedCarbocycles The number of saturated carbocycles for a molecule.

AliphaticCarbocycles
The number of aliphatic (containing at least one non-aromatic bond) carbocycles for a molecule.AromaticHeterocycles The number of aromatic heterocycles for a molecule.SaturatedHeterocycles The number of saturated heterocycles for a molecule.AliphaticHeterocycles The number of aliphatic (containing at least one non-aromatic bond) heterocycles for a molecule.

NumRings
The number of rings for a molecule.

CalcAUTOCORR2D
Returns 2D Autocorrelation descriptors vector.BertzCT A topological index meant to quantify "complexity" of molecules.

Ipc
The information content of the coefficients of the characteristic polynomial of a hydrogen-suppressed graph of a molecule.

HallKierAlpha
The Hall-Kier alpha value for a molecule.
The Labute ASA value for a molecule.A binned form of either the partial charge with a VSA.A binned form of either MolMR A binned form of either MolLogP.
data set [6].This algorithm only considers the selected subset of features when splitting nodes [7].K-nearest neighbors (KNN): categorizes compounds based on the training instances that are nearest to them in feature space [8].A compound is assigned to the class that is most popular among its q closest neighbors.The distance between data compounds is measured by Minkowski distance, and the value of q is set to three.Support Vector Machine (SVM): identifies the hyperplane dividing the various classes and maps the decision limit for each class [9].This approach is based on a kernel function, which is a mapping operation done on the training set to increase its similarity to a set of data that can be separated.The non-linear Radial Basis Function (RBF) kernel is used in this implementation.It is uncommon to see linearly separable data in real-world situations; hence, the RBF kernel outperforms the logistic regression algorithm in terms of finding patterns [10].
Gaussian Naive Bayes (GM): applies Bayes' theorem using the assumption of independence between each pair of features [11] and computes the probability that a given compound belongs to a given class.It belongs to a family of basic probabilistic classification techniques.
A total of 15 TMC' models (five ML algorithms and three molecular representations) were computed for each target without any feature filtering or selection with the available compound-target interaction data.The 30% of the total data was used for TCM evaluation (in addition to the VSD2 external validation data).
The applicability domain (AD) of each model was also computed.The AD refers to the region in space where the "normal" objects (compounds) were located [12], and it aims to address this by illustrating the chemical region outside of which forecasts cannot be regarded as confident [12,13].The AD was defined using a distance-based method (see Figure SM1.2) using the hamming distance for FGP descriptors and the euclidean distance for DSC descriptors.Additionally, the euclidean and hamming distances were applied simultaneously for evaluating AD using the FUS descriptors.For this procedure, the center of the training set was calculated using the standardized descriptors, and the maximum distance between the center and the training set compounds was taken as a maximum threshold [14].If the distance between a query compound and the center was larger than this threshold, the query compound was considered outside the AD.
The AD perspective was established and validated for each compound in the training, and then it was used to evaluate the compounds of VSD2 before address the TCM models.Then, the confusion matrix was computed to evaluate the performance measurement for the TCM models.A confusion matrix, which is a very popular measurement, is a matrix that displays how well a classification model performs on a set of test data [15].The confusion matrix depicts the percentages of the four classification outcomes that could be True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).TP are the interactions truly interact, TN are interactions that truly do not interact, FP are the interactions that are truly not interact, but based on the ChEMBL2 information, they were falsely denoted as positive interaction.and FN are the interactions that are truly interact, but based on the ChEMBL information, they were falsely denoted as negative interaction.These metrics are used to calculate other metrics, such as f1-score, that offer more insightful analyses of the prediction.
The results of the TCM evaluation in this section were reported in terms of the f1-score.The f1-score at its highest is 1, indicating perfect precision and recall; at its lowest possible value is 0.

Note SM1.3. Target-centric Models from Web Tools (WTCM
A collection of 17 publicly accessible WTCM models (Table SM1.2) that might be used as web tools are used for benchmarking.These models were taken from five websites [16] [17] [18] [19].
In [20] and [16], the MolTarPred's web tool publish a MTP model that uses similarity technique to identify the target profile of 10 size for small organic compounds.Similarity calculations o each compound are computed using the Morgan fingerprint.It takes into account a reliability score for the target profile which leads to higher prospective hit rates and where values less than three indicate a negative action.
In [17], the SwissTargetPrediction's web tool publish the STP model that predicts around 100 protein targets for a small molecules based on a similarity principle though reverse screening.It uses a probability score for rating the target profile which encompasses also several species.
In [18], the TargetNet's web tool publish six models based on different compound descriptors: ECFP6 (TS-ECFP6), FP2 (TS-FP2), Daylight (TS-Dl), MACCCs (TS-MACCCs), ECFP2 (TS-ECFP2) and ECFP4 (TS-ECFP4).The prediction models were constructed using naïve bayes algorithm and a variety of molecular descriptors.To enhance the prediction capacity, ensemble learning from these fingerprints was also applied to show a total of 623 human targets in its target profile.
In [21], Sea Bkslab web tool publish the SB model that used a similarity method and a statistical model was developed to rank the significance.They also confirmed their predictions experimentally and report p-value score in the 30 size target profile.The scores values with a p-value under 0.05 represents an active interaction.
In [19], the PPB2's web tool publish eight models: Compounds in VSD3 were used as input to WTCM algorithms using scraping strategies (Table SM1.2).Then, to evaluate the performance of these algorithms, the confusion matrix was also computed to determine the metrics of true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), and false negative rate (FNR).TPR estimates the fraction of positive interactions that were identified as positive interactions.TNR estimates the fraction of negative interactions that were identified as negative interactions.FPR estimates the fraction of negative interactions identified as positive interactions.FNR estimates the fraction of positive interactions identified as negative interactions.These metrics were used to compared the WTCM with the models trained in the previous section (15 TCM).
Additionally, the recovery rate and unknown rate metrics were defined for each compound in VSD3.The recovery rate for a compound measured the fraction of all targets present in the ChEMBL database for which a prediction can be made by a specific method.That is, no prediction could be performed for a target if no model has been obtained for it.If a compound-target predicted association waas outside a model's AD, this prediction was excluded for computing the recovery rate.This metric was defined by Equation 1, where T ChEM BL represents the number of targets present in the ChEMBL database.For a particular method (compound representation + ML algorithm), the overall recovery rate was computed as the mean of the metric across all compounds.The number of predicted targets.recovery = On the other hand, the unknown rate represented the proportion of interactions predicted by a method (compound representation + ML algorithm) for a given compound that has no experimental information in ChEMBL to be assessed.This metric was evaluated across all targets for which a valid model (TCM) is available.As for the recovery rate, a prediction outside a model's AD was labeled as unknown.So, the difference between the number of targets in VSD3 (T V SD3 ), and the total predicted targets inside the AD of the TCM (the sum of TP, FP, TN, and FP) was divided by the number of targets in VSD3 to obtain the unknown rate for each compound (Equation 2).For a particular method, the unknown rate was defined as the average of the metric across all compounds.

. Consensus strategies
Each target-centric model from a web tool(WTCM) (Table SM1.2) had a particular way of ranking the compound' target profile, and its targets prediction scores had different scales.Hence, to integrate all their algorithm's predictions in a consensus approach, some transformations were made to have all the scores on the same scale.First, each output from a web-centric algorithm that represents the possible interaction between the compound j and the protein k x j,k was normalized considering the following equations based on their output type: 1) Equation 3 was used for algorithms that use probability scores [0-1], 2) Equation 4was used for algorithms that use order scores [1-n], 3) Equation 5was used for algorithms that use the reliability of prediction scores [1-n], and 4) Equation 6was used for models using the p-value score [0-1] with a cutoff of 0.05. (5) Then, the consensus score (z j,k ) for the interaction between compound j and protein k across all algorithms published in the web tools i was computed as in eq.7.
In equation7, < y j,k i > was the average of the normalized values obtained from equations 3, 4, 5, and 6 for a given compound-protein, s a referred to the number of webtools identifying the protein k and S t was the total number of web tools.The consensus score z represented the reliability of the interaction between a compound under study and a particular target, considering the information across different algorithms and ranges from 0 to 1.
The consensus strategy over WTCM included all the algorithms that have information depending on the availability of the web tool.The predicted target list size (10 to 623 predictions, as shown in Table SM1.2) was also different for each algorithm, as mentioned previously.The consensus score was a ranking criterion for which a threshold of 0.5 was proposed to classify predictions of active or negative interactions.Several top-ranked fractions from 1% to 100% of the ranked list (step size of 5%) were analyzed to evaluate the performance of each fraction of the ranked lists.For each fraction of the ranked list, the consensus scores over 0.5 were considered positive interactions, while the remaining values were considered as negative ones.
Next, the predicted target profiles, over the consensus strategy with WTCM, were sorted descending by consensus score.The confusion matrix and the performance metrics [23] of TPR, TNR, FPR and FNR were calculated in each fraction of the ranked splits.Additionally, the metrics of recovery rate and unknown rate were computed as defined in the previous section.According to this viewpoint, the values that were at the top of the splits had the most accurate forecasts (it was some sort of initial enrichment usually applied in virtual screening).
In the case of trained target-centric models(TCM), the predicted target lists per compound were the same size regardless of the employed modeling method since there were the same trained models.Therefore, an analysis was done to keep a representation of the 15 models and to maintain their diversity (this process was not required with the previous WTCM models because the output target list is different for each algorithm) before performing the consensus strategy.Diversity in base predictors was one of the most important aspects to consider for fusion strategy methods since this avoids biases toward over-represented decision-makers.
The rand index (RI) metric was used to cluster the prediction's binary vectors according to their similarity.Output vectors for the VSD2 dataset were selected for clustering in the TCM.RI is one of the most known indices for measuring the similarity between two clusters and has been widely used in various studies [24].It was used to create a similarity matrix to group TCM with hierarchical clustering.The RI is bound between 0 and 1, with 0 indicating that the binary vectors do not agree on any pair of points and 1 indicating that the data clustering is the same.The hierarchical clustering was built using the matrix of pairwise RI values over the 15 TCM, and three cutoffs were set to identify three different TCM groups.Three groups of TCM were generated, one per clustering threshold.The performance of the TCM groups was evaluated as described above for the fusion of WTCM.
Then, as well as WTCM consensus, the consensus strategy was also performed over TCM for the group with the best performance over the compounds in VDS3.
The outputs of all TCM were membership class probabilities p k,i in the range [0, 1], so no normalization was required for their fusion.The mean across all probabilities of the sample models was computed (Equation 8), and the target profile was sorted descendingly using the differences in class probabilities (active probability minus inactive probability) as shown in.Then, different top-ranked subsets from 1% to 100% (step size of 5%) were also evaluated.The probability differences over 0 (equivalent to the 0.5 used in the algorithms of the web tools) were regarded as positive bioactivity, while the remaining values were regarded as negative bioactivity in each subset.Results were also reported in terms of the performance metrics of TPR, TNR, FPR, FNR, recovery rate, and unknown rate.

Fig. SM1. 2
Fig. SM1.2An example of a distance-based' applicability domain (AD) computed over the training sets of compounds (orange triangles) before training the target-centric model (TCM).AD' analysis helps to understand if the TCM can be used for any set of compounds.According to it, the testing sets of compounds (green dots) are assessed over the predefined AD.This means that, for a trustworthy prediction, compounds outside of AD are not considered.

2 4
target predictions of each web-tool were collected with scraping techniques. 1 The scores values with values under 3 represents negative activity.The scores values with probability 0 represents an inactive interaction.3 The scores values with a p-value under 0.05 represents an active interaction.It only allows molecule smiles under 200 characters.5 It does not provide inactive interactions, only the targets with potential interaction.6

Table SM1 . 1
General physiochemical properties of interest.